Chronic Kidney Disease (CKD) or chronic renal disease has become a major issue with a steady growth rate. A person can only survive without kidneys for an average time of 18 days, which makes a huge demand for a kidney transplant and Dialysis. It is important to have effective methods for early prediction of CKD. Machine learning methods are effective in CKD prediction. This work proposes a workflow to predict CKD status based on clinical data, incorporating data prepossessing, a missing value handling method with collaborative filtering and attributes selection. Out of the 11 machine learning methods considered, the extra tree classifier and random forest classifier are shown to result in the highest accuracy and minimal bias to the attributes. The research also considers the practical aspects of data collection and highlights the importance of incorporating domain knowledge when using machine learning for CKD status prediction.

It identifies the limitations in handling missing values when analysing CKD data, proposes a new method to handle missing values and presents the evaluation of different methods based on UCI dataset. Further, this work also highlights the importance of statistical analysis as well as the domain knowledge of the features when making a prediction based on clinical data related to CKD.
chronic renal disease, machine learning, classification algorithms, extra tree classifier, random forest classifier,XGBoost
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
kidney=pd.read_csv(r'E:\Python_ML_projects\Project 2-Chronic Kidney Disease\kidney_disease.csv')
kidney.shape
(400, 26)
kidney.head()
| id | age | bp | sg | al | su | rbc | pc | pcc | ba | ... | pcv | wc | rc | htn | dm | cad | appet | pe | ane | classification | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 44 | 7800 | 5.2 | yes | yes | no | good | no | no | ckd |
| 1 | 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 38 | 6000 | NaN | no | no | no | good | no | no | ckd |
| 2 | 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | normal | normal | notpresent | notpresent | ... | 31 | 7500 | NaN | no | yes | no | poor | no | yes | ckd |
| 3 | 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | normal | abnormal | present | notpresent | ... | 32 | 6700 | 3.9 | yes | no | no | poor | yes | yes | ckd |
| 4 | 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 35 | 7300 | 4.6 | no | no | no | good | no | no | ckd |
5 rows × 26 columns
kidney.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 400 entries, 0 to 399 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 400 non-null int64 1 age 391 non-null float64 2 bp 388 non-null float64 3 sg 353 non-null float64 4 al 354 non-null float64 5 su 351 non-null float64 6 rbc 248 non-null object 7 pc 335 non-null object 8 pcc 396 non-null object 9 ba 396 non-null object 10 bgr 356 non-null float64 11 bu 381 non-null float64 12 sc 383 non-null float64 13 sod 313 non-null float64 14 pot 312 non-null float64 15 hemo 348 non-null float64 16 pcv 330 non-null object 17 wc 295 non-null object 18 rc 270 non-null object 19 htn 398 non-null object 20 dm 398 non-null object 21 cad 398 non-null object 22 appet 399 non-null object 23 pe 399 non-null object 24 ane 399 non-null object 25 classification 400 non-null object dtypes: float64(11), int64(1), object(14) memory usage: 81.4+ KB
kidney.describe()
| id | age | bp | sg | al | su | bgr | bu | sc | sod | pot | hemo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 400.000000 | 391.000000 | 388.000000 | 353.000000 | 354.000000 | 351.000000 | 356.000000 | 381.000000 | 383.000000 | 313.000000 | 312.000000 | 348.000000 |
| mean | 199.500000 | 51.483376 | 76.469072 | 1.017408 | 1.016949 | 0.450142 | 148.036517 | 57.425722 | 3.072454 | 137.528754 | 4.627244 | 12.526437 |
| std | 115.614301 | 17.169714 | 13.683637 | 0.005717 | 1.352679 | 1.099191 | 79.281714 | 50.503006 | 5.741126 | 10.408752 | 3.193904 | 2.912587 |
| min | 0.000000 | 2.000000 | 50.000000 | 1.005000 | 0.000000 | 0.000000 | 22.000000 | 1.500000 | 0.400000 | 4.500000 | 2.500000 | 3.100000 |
| 25% | 99.750000 | 42.000000 | 70.000000 | 1.010000 | 0.000000 | 0.000000 | 99.000000 | 27.000000 | 0.900000 | 135.000000 | 3.800000 | 10.300000 |
| 50% | 199.500000 | 55.000000 | 80.000000 | 1.020000 | 0.000000 | 0.000000 | 121.000000 | 42.000000 | 1.300000 | 138.000000 | 4.400000 | 12.650000 |
| 75% | 299.250000 | 64.500000 | 80.000000 | 1.020000 | 2.000000 | 0.000000 | 163.000000 | 66.000000 | 2.800000 | 142.000000 | 4.900000 | 15.000000 |
| max | 399.000000 | 90.000000 | 180.000000 | 1.025000 | 5.000000 | 5.000000 | 490.000000 | 391.000000 | 76.000000 | 163.000000 | 47.000000 | 17.800000 |
columns=pd.read_csv(r"E:\Python_ML_projects\Project 2-Chronic Kidney Disease\data_description.txt",sep='-')
columns=columns.reset_index()
columns.columns=['cols','abb_col_names']
columns
| cols | abb_col_names | |
|---|---|---|
| 0 | id | id |
| 1 | age | age |
| 2 | bp | blood pressure |
| 3 | sg | specific gravity |
| 4 | al | albumin |
| 5 | su | sugar |
| 6 | rbc | red blood cells |
| 7 | pc | pus cell |
| 8 | pcc | pus cell clumps |
| 9 | ba | bacteria |
| 10 | bgr | blood glucose random |
| 11 | bu | blood urea |
| 12 | sc | serum creatinine |
| 13 | sod | sodium |
| 14 | pot | potassium |
| 15 | hemo | haemoglobin |
| 16 | pcv | packed cell volume |
| 17 | wc | white blood cell count |
| 18 | rc | red blood cell count |
| 19 | htn | ypertension |
| 20 | dm | diabetes mellitus |
| 21 | cad | coronary artery disease |
| 22 | appet | appetite |
| 23 | pe | pedal edema |
| 24 | ane | anemia |
| 25 | classification | class |
kidney.head()
| id | age | bp | sg | al | su | rbc | pc | pcc | ba | ... | pcv | wc | rc | htn | dm | cad | appet | pe | ane | classification | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 44 | 7800 | 5.2 | yes | yes | no | good | no | no | ckd |
| 1 | 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 38 | 6000 | NaN | no | no | no | good | no | no | ckd |
| 2 | 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | normal | normal | notpresent | notpresent | ... | 31 | 7500 | NaN | no | yes | no | poor | no | yes | ckd |
| 3 | 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | normal | abnormal | present | notpresent | ... | 32 | 6700 | 3.9 | yes | no | no | poor | yes | yes | ckd |
| 4 | 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 35 | 7300 | 4.6 | no | no | no | good | no | no | ckd |
5 rows × 26 columns
kidney.columns=columns['abb_col_names'].values
kidney.head()
| id | age | blood pressure | specific gravity | albumin | sugar | red blood cells | pus cell | pus cell clumps | bacteria | ... | packed cell volume | white blood cell count | red blood cell count | ypertension | diabetes mellitus | coronary artery disease | appetite | pedal edema | anemia | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 44 | 7800 | 5.2 | yes | yes | no | good | no | no | ckd |
| 1 | 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 38 | 6000 | NaN | no | no | no | good | no | no | ckd |
| 2 | 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | normal | normal | notpresent | notpresent | ... | 31 | 7500 | NaN | no | yes | no | poor | no | yes | ckd |
| 3 | 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | normal | abnormal | present | notpresent | ... | 32 | 6700 | 3.9 | yes | no | no | poor | yes | yes | ckd |
| 4 | 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 35 | 7300 | 4.6 | no | no | no | good | no | no | ckd |
5 rows × 26 columns
kidney.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 400.0 | 199.500000 | 115.614301 | 0.000 | 99.75 | 199.50 | 299.25 | 399.000 |
| age | 391.0 | 51.483376 | 17.169714 | 2.000 | 42.00 | 55.00 | 64.50 | 90.000 |
| blood pressure | 388.0 | 76.469072 | 13.683637 | 50.000 | 70.00 | 80.00 | 80.00 | 180.000 |
| specific gravity | 353.0 | 1.017408 | 0.005717 | 1.005 | 1.01 | 1.02 | 1.02 | 1.025 |
| albumin | 354.0 | 1.016949 | 1.352679 | 0.000 | 0.00 | 0.00 | 2.00 | 5.000 |
| sugar | 351.0 | 0.450142 | 1.099191 | 0.000 | 0.00 | 0.00 | 0.00 | 5.000 |
| blood glucose random | 356.0 | 148.036517 | 79.281714 | 22.000 | 99.00 | 121.00 | 163.00 | 490.000 |
| blood urea | 381.0 | 57.425722 | 50.503006 | 1.500 | 27.00 | 42.00 | 66.00 | 391.000 |
| serum creatinine | 383.0 | 3.072454 | 5.741126 | 0.400 | 0.90 | 1.30 | 2.80 | 76.000 |
| sodium | 313.0 | 137.528754 | 10.408752 | 4.500 | 135.00 | 138.00 | 142.00 | 163.000 |
| potassium | 312.0 | 4.627244 | 3.193904 | 2.500 | 3.80 | 4.40 | 4.90 | 47.000 |
| haemoglobin | 348.0 | 12.526437 | 2.912587 | 3.100 | 10.30 | 12.65 | 15.00 | 17.800 |
def convert_dtype(kidney,feature):
kidney[feature]=pd.to_numeric(kidney[feature],errors='coerce') #whereever we have Nan values , this errors parameter will hanfle that
features=['packed cell volume','white blood cell count','red blood cell count']
for i in features:
convert_dtype(kidney,i)
kidney.dtypes
id int64 age float64 blood pressure float64 specific gravity float64 albumin float64 sugar float64 red blood cells object pus cell object pus cell clumps object bacteria object blood glucose random float64 blood urea float64 serum creatinine float64 sodium float64 potassium float64 haemoglobin float64 packed cell volume float64 white blood cell count float64 red blood cell count float64 ypertension object diabetes mellitus object coronary artery disease object appetite object pedal edema object anemia object class object dtype: object
kidney.drop('id',inplace=True,axis=1)
def extract_cat_num(kidney):
cat_col=[col for col in kidney.columns if kidney[col].dtype=='O']
num_col=[col for col in kidney.columns if kidney[col].dtype!='O']
return cat_col,num_col
cat_col,num_col=extract_cat_num(kidney)
cat_col
['red blood cells', ' pus cell', 'pus cell clumps', 'bacteria', 'ypertension', 'diabetes mellitus', 'coronary artery disease', 'appetite', 'pedal edema', 'anemia', 'class']
num_col
['age', 'blood pressure', 'specific gravity', 'albumin', 'sugar', 'blood glucose random', 'blood urea', 'serum creatinine', 'sodium', 'potassium', 'haemoglobin', 'packed cell volume', 'white blood cell count', 'red blood cell count']
# dirtiness in categorical data
for col in cat_col:
print('{} has {} values'.format(col,kidney[col].unique()))
print("\n")
red blood cells has [nan 'normal' 'abnormal'] values pus cell has ['normal' 'abnormal' nan] values pus cell clumps has ['notpresent' 'present' nan] values bacteria has ['notpresent' 'present' nan] values ypertension has ['yes' 'no' nan] values diabetes mellitus has ['yes' 'no' ' yes' '\tno' '\tyes' nan] values coronary artery disease has ['no' 'yes' '\tno' nan] values appetite has ['good' 'poor' nan] values pedal edema has ['no' 'yes' nan] values anemia has ['no' 'yes' nan] values class has ['ckd' 'ckd\t' 'notckd'] values
kidney['diabetes mellitus'].replace(to_replace={'\tno':'no','\tyes':'yes'},inplace=True)
kidney['coronary artery disease'].replace(to_replace={'\tno':'no'},inplace=True)
kidney['class'].replace(to_replace={'ckd\t':'ckd'},inplace=True)
# no dirtiness
for col in cat_col:
print('{} has {} values'.format(col,kidney[col].unique()))
print("\n")
red blood cells has [nan 'normal' 'abnormal'] values pus cell has ['normal' 'abnormal' nan] values pus cell clumps has ['notpresent' 'present' nan] values bacteria has ['notpresent' 'present' nan] values ypertension has ['yes' 'no' nan] values diabetes mellitus has ['yes' 'no' ' yes' nan] values coronary artery disease has ['no' 'yes' nan] values appetite has ['good' 'poor' nan] values pedal edema has ['no' 'yes' nan] values anemia has ['no' 'yes' nan] values class has ['ckd' 'notckd'] values
len(num_col)
14
plt.figure(figsize=(30,30))
for i,feature in enumerate(num_col):
plt.subplot(5,3,i+1) # 5 rows and 3 columns
kidney[feature].hist()
plt.title(feature)
len(cat_col)
11
plt.figure(figsize=(20,20))
for i,feature in enumerate(cat_col):
plt.subplot(4,3,i+1)
sns.countplot(kidney[feature])
plt.figure(figsize=(20,20))
for i,feature in enumerate(cat_col):
plt.subplot(4,3,i+1)
sns.countplot(kidney[feature],hue=kidney['class'])
sns.countplot(kidney['class'])
<AxesSubplot:xlabel='class', ylabel='count'>
kidney.corr()
| age | blood pressure | specific gravity | albumin | sugar | blood glucose random | blood urea | serum creatinine | sodium | potassium | haemoglobin | packed cell volume | white blood cell count | red blood cell count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 1.000000 | 0.159480 | -0.191096 | 0.122091 | 0.220866 | 0.244992 | 0.196985 | 0.132531 | -0.100046 | 0.058377 | -0.192928 | -0.242119 | 0.118339 | -0.268896 |
| blood pressure | 0.159480 | 1.000000 | -0.218836 | 0.160689 | 0.222576 | 0.160193 | 0.188517 | 0.146222 | -0.116422 | 0.075151 | -0.306540 | -0.326319 | 0.029753 | -0.261936 |
| specific gravity | -0.191096 | -0.218836 | 1.000000 | -0.469760 | -0.296234 | -0.374710 | -0.314295 | -0.361473 | 0.412190 | -0.072787 | 0.602582 | 0.603560 | -0.236215 | 0.579476 |
| albumin | 0.122091 | 0.160689 | -0.469760 | 1.000000 | 0.269305 | 0.379464 | 0.453528 | 0.399198 | -0.459896 | 0.129038 | -0.634632 | -0.611891 | 0.231989 | -0.566437 |
| sugar | 0.220866 | 0.222576 | -0.296234 | 0.269305 | 1.000000 | 0.717827 | 0.168583 | 0.223244 | -0.131776 | 0.219450 | -0.224775 | -0.239189 | 0.184893 | -0.237448 |
| blood glucose random | 0.244992 | 0.160193 | -0.374710 | 0.379464 | 0.717827 | 1.000000 | 0.143322 | 0.114875 | -0.267848 | 0.066966 | -0.306189 | -0.301385 | 0.150015 | -0.281541 |
| blood urea | 0.196985 | 0.188517 | -0.314295 | 0.453528 | 0.168583 | 0.143322 | 1.000000 | 0.586368 | -0.323054 | 0.357049 | -0.610360 | -0.607621 | 0.050462 | -0.579087 |
| serum creatinine | 0.132531 | 0.146222 | -0.361473 | 0.399198 | 0.223244 | 0.114875 | 0.586368 | 1.000000 | -0.690158 | 0.326107 | -0.401670 | -0.404193 | -0.006390 | -0.400852 |
| sodium | -0.100046 | -0.116422 | 0.412190 | -0.459896 | -0.131776 | -0.267848 | -0.323054 | -0.690158 | 1.000000 | 0.097887 | 0.365183 | 0.376914 | 0.007277 | 0.344873 |
| potassium | 0.058377 | 0.075151 | -0.072787 | 0.129038 | 0.219450 | 0.066966 | 0.357049 | 0.326107 | 0.097887 | 1.000000 | -0.133746 | -0.163182 | -0.105576 | -0.158309 |
| haemoglobin | -0.192928 | -0.306540 | 0.602582 | -0.634632 | -0.224775 | -0.306189 | -0.610360 | -0.401670 | 0.365183 | -0.133746 | 1.000000 | 0.895382 | -0.169413 | 0.798880 |
| packed cell volume | -0.242119 | -0.326319 | 0.603560 | -0.611891 | -0.239189 | -0.301385 | -0.607621 | -0.404193 | 0.376914 | -0.163182 | 0.895382 | 1.000000 | -0.197022 | 0.791625 |
| white blood cell count | 0.118339 | 0.029753 | -0.236215 | 0.231989 | 0.184893 | 0.150015 | 0.050462 | -0.006390 | 0.007277 | -0.105576 | -0.169413 | -0.197022 | 1.000000 | -0.158163 |
| red blood cell count | -0.268896 | -0.261936 | 0.579476 | -0.566437 | -0.237448 | -0.281541 | -0.579087 | -0.400852 | 0.344873 | -0.158309 | 0.798880 | 0.791625 | -0.158163 | 1.000000 |
plt.figure(figsize=(12,12))
sns.heatmap(kidney.corr(method='pearson'),cbar=True,cmap='BuPu',annot=True)
<AxesSubplot:>
kidney.groupby(['red blood cells','class'])['red blood cell count'].agg(['count','mean','median','min','max'])
| count | mean | median | min | max | ||
|---|---|---|---|---|---|---|
| red blood cells | class | |||||
| abnormal | ckd | 25 | 3.832000 | 3.7 | 2.5 | 5.6 |
| normal | ckd | 40 | 3.782500 | 3.8 | 2.1 | 8.0 |
| notckd | 134 | 5.368657 | 5.3 | 4.4 | 6.5 |
We can observe that when a person is not diseased its rbc count is 134, mean is also high whereas when he is diseased count drop downs to 25-40 and mean is low.
!pip install plotly
Requirement already satisfied: plotly in c:\users\admin\anaconda3\lib\site-packages (5.6.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from plotly) (8.0.1) Requirement already satisfied: six in c:\users\admin\anaconda3\lib\site-packages (from plotly) (1.16.0)
import plotly.express as px
px.violin( kidney ,y='red blood cell count', x='class', color='class')
plt.figure(figsize=(10,10))
plt.scatter(x=kidney.haemoglobin,y=kidney['packed cell volume'])
plt.xlabel('Haemoglobin')
plt.ylabel('packed cell volume')
plt.title('Relationship between haemoglobin and packed cell volume')
Text(0.5, 1.0, 'Relationship between haemoglobin and packed cell volume')
We can see that there is a linear relationship between haemoglobin and pacled cell volume
grid=sns.FacetGrid(kidney,hue='class',aspect=2)
grid.map(sns.kdeplot,'red blood cell count')
grid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1d39258b370>
from above visuals we can say that person with lower rbc count have high chances of having chronic disease
grid=sns.FacetGrid(kidney,hue='class',aspect=2)
grid.map(sns.kdeplot,'haemoglobin')
grid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1d39269c820>
Alternatively to avoid the reusage of multiple lines of codes we create functions and definition calls thus enhancing efficiency
def violin(col):
fig = px.violin(kidney,y=col,x='class',color='class',box=True)
return fig.show()
def scatters(col1,col2):
fig = px.scatter(kidney,x=col1,y=col2,color='class')
return fig.show()
def kde_plot(feature):
grid=sns.FacetGrid(kidney,hue='class',aspect=2)
grid.map(sns.kdeplot,feature)
grid.add_legend()
kde_plot('red blood cell count')
kde_plot('haemoglobin')
scatters('red blood cell count','packed cell volume')
plt.figure(figsize=(12,10))
sns.scatterplot(x=kidney['red blood cell count'],y=kidney['packed cell volume'],hue=kidney['class'])
plt.xlabel('red blood cell count')
plt.ylabel('packed cell volume')
plt.title('Relationship between red blood cell count and packed cell volume')
Text(0.5, 1.0, 'Relationship between red blood cell count and packed cell volume')
plt.figure(figsize=(12,10))
sns.scatterplot(x=kidney['red blood cell count'],y=kidney['haemoglobin'],hue=kidney['class'])
plt.xlabel('red blood cell count')
plt.ylabel('haemoglobin')
plt.title('Relationship between haemoglobin and red blood cell count')
Text(0.5, 1.0, 'Relationship between haemoglobin and red blood cell count')
violin('red blood cell count')
scatters('red blood cell count','albumin')
kidney.isnull().sum()
age 9 blood pressure 12 specific gravity 47 albumin 46 sugar 49 red blood cells 152 pus cell 65 pus cell clumps 4 bacteria 4 blood glucose random 44 blood urea 19 serum creatinine 17 sodium 87 potassium 88 haemoglobin 52 packed cell volume 71 white blood cell count 106 red blood cell count 131 ypertension 2 diabetes mellitus 2 coronary artery disease 2 appetite 1 pedal edema 1 anemia 1 class 0 dtype: int64
kidney.isnull().sum().sort_values(ascending=False)
red blood cells 152 red blood cell count 131 white blood cell count 106 potassium 88 sodium 87 packed cell volume 71 pus cell 65 haemoglobin 52 sugar 49 specific gravity 47 albumin 46 blood glucose random 44 blood urea 19 serum creatinine 17 blood pressure 12 age 9 bacteria 4 pus cell clumps 4 ypertension 2 diabetes mellitus 2 coronary artery disease 2 appetite 1 pedal edema 1 anemia 1 class 0 dtype: int64
We can fill this missing values with mean,median or std deviat
plt.subplot(1,2,1)
sns.boxplot(x=kidney['class'],y=kidney['age'])
<AxesSubplot:xlabel='class', ylabel='age'>
list(enumerate(cat_col))
[(0, 'red blood cells'), (1, ' pus cell'), (2, 'pus cell clumps'), (3, 'bacteria'), (4, 'ypertension'), (5, 'diabetes mellitus'), (6, 'coronary artery disease'), (7, 'appetite'), (8, 'pedal edema'), (9, 'anemia'), (10, 'class')]
plt.figure(figsize=(15,15))
for i in enumerate(num_col):
plt.subplot(4,4,i[0]+1)
sns.boxplot(x=kidney['class'],y=i[1],data=kidney.reset_index())
there are outliers in dataset so filling missing values with mean is not feasible , i will use median to fill missing values
np.mean(kidney)
age 51.483376 blood pressure 76.469072 specific gravity 1.017408 albumin 1.016949 sugar 0.450142 blood glucose random 148.036517 blood urea 57.425722 serum creatinine 3.072454 sodium 137.528754 potassium 4.627244 haemoglobin 12.526437 packed cell volume 38.884498 white blood cell count 8406.122449 red blood cell count 4.707435 dtype: float64
kidney.isnull().sum()
age 9 blood pressure 12 specific gravity 47 albumin 46 sugar 49 red blood cells 152 pus cell 65 pus cell clumps 4 bacteria 4 blood glucose random 44 blood urea 19 serum creatinine 17 sodium 87 potassium 88 haemoglobin 52 packed cell volume 71 white blood cell count 106 red blood cell count 131 ypertension 2 diabetes mellitus 2 coronary artery disease 2 appetite 1 pedal edema 1 anemia 1 class 0 dtype: int64
for i in num_col:
kidney[i].fillna(kidney[i].median(),inplace=True)
kidney.isnull().sum()
age 0 blood pressure 0 specific gravity 0 albumin 0 sugar 0 red blood cells 152 pus cell 65 pus cell clumps 4 bacteria 4 blood glucose random 0 blood urea 0 serum creatinine 0 sodium 0 potassium 0 haemoglobin 0 packed cell volume 0 white blood cell count 0 red blood cell count 0 ypertension 2 diabetes mellitus 2 coronary artery disease 2 appetite 1 pedal edema 1 anemia 1 class 0 dtype: int64
kidney.describe()
| age | blood pressure | specific gravity | albumin | sugar | blood glucose random | blood urea | serum creatinine | sodium | potassium | haemoglobin | packed cell volume | white blood cell count | red blood cell count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 400.000000 | 400.000000 | 400.000000 | 400.00000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.00000 | 400.000000 | 400.000000 | 400.000000 |
| mean | 51.562500 | 76.575000 | 1.017712 | 0.90000 | 0.395000 | 145.062500 | 56.693000 | 2.997125 | 137.631250 | 4.577250 | 12.54250 | 39.082500 | 8298.500000 | 4.737750 |
| std | 16.982996 | 13.489785 | 0.005434 | 1.31313 | 1.040038 | 75.260774 | 49.395258 | 5.628886 | 9.206332 | 2.821357 | 2.71649 | 8.162245 | 2529.593814 | 0.841439 |
| min | 2.000000 | 50.000000 | 1.005000 | 0.00000 | 0.000000 | 22.000000 | 1.500000 | 0.400000 | 4.500000 | 2.500000 | 3.10000 | 9.000000 | 2200.000000 | 2.100000 |
| 25% | 42.000000 | 70.000000 | 1.015000 | 0.00000 | 0.000000 | 101.000000 | 27.000000 | 0.900000 | 135.000000 | 4.000000 | 10.87500 | 34.000000 | 6975.000000 | 4.500000 |
| 50% | 55.000000 | 80.000000 | 1.020000 | 0.00000 | 0.000000 | 121.000000 | 42.000000 | 1.300000 | 138.000000 | 4.400000 | 12.65000 | 40.000000 | 8000.000000 | 4.800000 |
| 75% | 64.000000 | 80.000000 | 1.020000 | 2.00000 | 0.000000 | 150.000000 | 61.750000 | 2.725000 | 141.000000 | 4.800000 | 14.62500 | 44.000000 | 9400.000000 | 5.100000 |
| max | 90.000000 | 180.000000 | 1.025000 | 5.00000 | 5.000000 | 490.000000 | 391.000000 | 76.000000 | 163.000000 | 47.000000 | 17.80000 | 54.000000 | 26400.000000 | 8.000000 |
It was more important to find the missing values and need to clean thos emissing values by using different menthods. ( I've dropped the NULL Values ). Missing Values leads to False Output and sometimes cause many Problems while Evaluating our Model.

kidney['red blood cells'].isnull().sum()
152
random_sample=kidney['red blood cells'].dropna().sample(152)
random_sample
300 normal
176 normal
296 normal
276 normal
221 normal
...
27 normal
54 normal
302 normal
361 normal
371 normal
Name: red blood cells, Length: 152, dtype: object
kidney[kidney['red blood cells'].isnull()].index
Int64Index([ 0, 1, 5, 6, 10, 12, 13, 15, 16, 17,
...
245, 268, 280, 290, 295, 309, 322, 349, 350, 381],
dtype='int64', length=152)
random_sample.index
Int64Index([300, 176, 296, 276, 221, 348, 230, 390, 116, 363,
...
368, 327, 278, 285, 154, 27, 54, 302, 361, 371],
dtype='int64', length=152)
We can see that indexes are different , while putting random values indexes must be equal
random_sample.index=kidney[kidney['red blood cells'].isnull()].index #in this way index will be equal
random_sample.index
Int64Index([ 0, 1, 5, 6, 10, 12, 13, 15, 16, 17,
...
245, 268, 280, 290, 295, 309, 322, 349, 350, 381],
dtype='int64', length=152)
kidney.loc[kidney['red blood cells'].isnull(),'red blood cells']=random_sample
kidney.head()
| age | blood pressure | specific gravity | albumin | sugar | red blood cells | pus cell | pus cell clumps | bacteria | blood glucose random | ... | packed cell volume | white blood cell count | red blood cell count | ypertension | diabetes mellitus | coronary artery disease | appetite | pedal edema | anemia | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | normal | normal | notpresent | notpresent | 121.0 | ... | 44.0 | 7800.0 | 5.2 | yes | yes | no | good | no | no | ckd |
| 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | normal | normal | notpresent | notpresent | 121.0 | ... | 38.0 | 6000.0 | 4.8 | no | no | no | good | no | no | ckd |
| 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | normal | normal | notpresent | notpresent | 423.0 | ... | 31.0 | 7500.0 | 4.8 | no | yes | no | poor | no | yes | ckd |
| 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | normal | abnormal | present | notpresent | 117.0 | ... | 32.0 | 6700.0 | 3.9 | yes | no | no | poor | yes | yes | ckd |
| 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | normal | normal | notpresent | notpresent | 106.0 | ... | 35.0 | 7300.0 | 4.6 | no | no | no | good | no | no | ckd |
5 rows × 25 columns
kidney['red blood cells'].isnull().sum()
0
sns.countplot(kidney['red blood cells']) # checking that ratio didnt change after filling missing values
<AxesSubplot:xlabel='red blood cells', ylabel='count'>
ratio didnt changed
#filling random values in all categorical columns
def Random_value_Imputation(feature):
random_sample=kidney[feature].dropna().sample(kidney[feature].isnull().sum())
random_sample.index=kidney[kidney[feature].isnull()].index
kidney.loc[kidney[feature].isnull(),feature]=random_sample
Random_value_Imputation(' pus cell') #only this column because it has higher no. of missing value
kidney.isnull().sum()
age 0 blood pressure 0 specific gravity 0 albumin 0 sugar 0 red blood cells 0 pus cell 0 pus cell clumps 4 bacteria 4 blood glucose random 0 blood urea 0 serum creatinine 0 sodium 0 potassium 0 haemoglobin 0 packed cell volume 0 white blood cell count 0 red blood cell count 0 ypertension 2 diabetes mellitus 2 coronary artery disease 2 appetite 1 pedal edema 1 anemia 1 class 0 dtype: int64
Those categorical variables who have less no. of missing values then we can replace it with mode
def impute_mode(feature):
mode=kidney[feature].mode()[0]
kidney[feature]=kidney[feature].fillna(mode)
for col in cat_col:
impute_mode(col)
kidney[cat_col].isnull().sum()
red blood cells 0 pus cell 0 pus cell clumps 0 bacteria 0 ypertension 0 diabetes mellitus 0 coronary artery disease 0 appetite 0 pedal edema 0 anemia 0 class 0 dtype: int64
kidney.isnull().sum()
age 0 blood pressure 0 specific gravity 0 albumin 0 sugar 0 red blood cells 0 pus cell 0 pus cell clumps 0 bacteria 0 blood glucose random 0 blood urea 0 serum creatinine 0 sodium 0 potassium 0 haemoglobin 0 packed cell volume 0 white blood cell count 0 red blood cell count 0 ypertension 0 diabetes mellitus 0 coronary artery disease 0 appetite 0 pedal edema 0 anemia 0 class 0 dtype: int64
We can see that there is no missing value now
Machine learning models can only work with numerical values. For this reason, it is necessary to transform the categorical values of the relevant features into numerical ones. This process is called feature encoding.
for col in cat_col:
print('{} has {} categories'.format(col,kidney[col].nunique()))
red blood cells has 2 categories pus cell has 2 categories pus cell clumps has 2 categories bacteria has 2 categories ypertension has 2 categories diabetes mellitus has 3 categories coronary artery disease has 2 categories appetite has 2 categories pedal edema has 2 categories anemia has 2 categories class has 2 categories
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for col in cat_col:
kidney[col]=le.fit_transform(kidney[col])
kidney.head()
| age | blood pressure | specific gravity | albumin | sugar | red blood cells | pus cell | pus cell clumps | bacteria | blood glucose random | ... | packed cell volume | white blood cell count | red blood cell count | ypertension | diabetes mellitus | coronary artery disease | appetite | pedal edema | anemia | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | 1 | 1 | 0 | 0 | 121.0 | ... | 44.0 | 7800.0 | 5.2 | 1 | 2 | 0 | 0 | 0 | 0 | 0 |
| 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | 1 | 1 | 0 | 0 | 121.0 | ... | 38.0 | 6000.0 | 4.8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | 1 | 1 | 0 | 0 | 423.0 | ... | 31.0 | 7500.0 | 4.8 | 0 | 2 | 0 | 1 | 0 | 1 | 0 |
| 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | 1 | 0 | 1 | 0 | 117.0 | ... | 32.0 | 6700.0 | 3.9 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
| 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | 1 | 1 | 0 | 0 | 106.0 | ... | 35.0 | 7300.0 | 4.6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 rows × 25 columns
SelectKBest: Feature selection is a technique where we choose those features in our data that contribute most to the target variable. In other words we choose the best predictors for the target variable. The classes in the sklearn.
chi2: A chi-square (χ2) statistic is a test that measures how a model compares to actual observed data. ... The chi-square statistic compares the size any discrepancies between the expected results and the actual results, given the size of the sample and the number of variables in the relationship.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
ind_col=[col for col in kidney.columns if col!='class']
dep_col='class'
X=kidney[ind_col]
y=kidney[dep_col]
X.head()
| age | blood pressure | specific gravity | albumin | sugar | red blood cells | pus cell | pus cell clumps | bacteria | blood glucose random | ... | haemoglobin | packed cell volume | white blood cell count | red blood cell count | ypertension | diabetes mellitus | coronary artery disease | appetite | pedal edema | anemia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | 1 | 1 | 0 | 0 | 121.0 | ... | 15.4 | 44.0 | 7800.0 | 5.2 | 1 | 2 | 0 | 0 | 0 | 0 |
| 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | 1 | 1 | 0 | 0 | 121.0 | ... | 11.3 | 38.0 | 6000.0 | 4.8 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | 1 | 1 | 0 | 0 | 423.0 | ... | 9.6 | 31.0 | 7500.0 | 4.8 | 0 | 2 | 0 | 1 | 0 | 1 |
| 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | 1 | 0 | 1 | 0 | 117.0 | ... | 11.2 | 32.0 | 6700.0 | 3.9 | 1 | 1 | 0 | 1 | 1 | 1 |
| 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | 1 | 1 | 0 | 0 | 106.0 | ... | 11.6 | 35.0 | 7300.0 | 4.6 | 0 | 1 | 0 | 0 | 0 | 0 |
5 rows × 24 columns
imp_features=SelectKBest(score_func=chi2,k=20)
imp_features=imp_features.fit(X,y)
imp_features
SelectKBest(k=20, score_func=<function chi2 at 0x000001D391B14040>)
imp_features.scores_
array([1.15859940e+02, 8.17867015e+01, 5.03531613e-03, 2.16000000e+02,
9.48000000e+01, 1.02639835e+01, 1.38744086e+01, 2.52000000e+01,
1.32000000e+01, 2.24165129e+03, 2.34309714e+03, 3.57792101e+02,
2.75587488e+01, 2.95133869e+00, 1.23856342e+02, 3.08181415e+02,
9.70105039e+03, 1.91130252e+01, 8.82000000e+01, 2.04392523e+01,
2.04000000e+01, 4.92000000e+01, 4.56000000e+01, 3.60000000e+01])
datascore=pd.DataFrame(imp_features.scores_,columns=['Score'])
datascore
| Score | |
|---|---|
| 0 | 115.859940 |
| 1 | 81.786701 |
| 2 | 0.005035 |
| 3 | 216.000000 |
| 4 | 94.800000 |
| 5 | 10.263983 |
| 6 | 13.874409 |
| 7 | 25.200000 |
| 8 | 13.200000 |
| 9 | 2241.651289 |
| 10 | 2343.097145 |
| 11 | 357.792101 |
| 12 | 27.558749 |
| 13 | 2.951339 |
| 14 | 123.856342 |
| 15 | 308.181415 |
| 16 | 9701.050391 |
| 17 | 19.113025 |
| 18 | 88.200000 |
| 19 | 20.439252 |
| 20 | 20.400000 |
| 21 | 49.200000 |
| 22 | 45.600000 |
| 23 | 36.000000 |
X.columns
Index(['age', 'blood pressure', 'specific gravity', 'albumin', 'sugar',
'red blood cells', ' pus cell', 'pus cell clumps', 'bacteria',
'blood glucose random', 'blood urea', 'serum creatinine', 'sodium',
'potassium', 'haemoglobin', 'packed cell volume',
'white blood cell count', 'red blood cell count', 'ypertension',
'diabetes mellitus', 'coronary artery disease', 'appetite',
'pedal edema', 'anemia'],
dtype='object')
dfcols=pd.DataFrame(X.columns)
dfcols
| 0 | |
|---|---|
| 0 | age |
| 1 | blood pressure |
| 2 | specific gravity |
| 3 | albumin |
| 4 | sugar |
| 5 | red blood cells |
| 6 | pus cell |
| 7 | pus cell clumps |
| 8 | bacteria |
| 9 | blood glucose random |
| 10 | blood urea |
| 11 | serum creatinine |
| 12 | sodium |
| 13 | potassium |
| 14 | haemoglobin |
| 15 | packed cell volume |
| 16 | white blood cell count |
| 17 | red blood cell count |
| 18 | ypertension |
| 19 | diabetes mellitus |
| 20 | coronary artery disease |
| 21 | appetite |
| 22 | pedal edema |
| 23 | anemia |
features_rank=pd.concat([dfcols,datascore],axis=1)
features_rank
| 0 | Score | |
|---|---|---|
| 0 | age | 115.859940 |
| 1 | blood pressure | 81.786701 |
| 2 | specific gravity | 0.005035 |
| 3 | albumin | 216.000000 |
| 4 | sugar | 94.800000 |
| 5 | red blood cells | 10.263983 |
| 6 | pus cell | 13.874409 |
| 7 | pus cell clumps | 25.200000 |
| 8 | bacteria | 13.200000 |
| 9 | blood glucose random | 2241.651289 |
| 10 | blood urea | 2343.097145 |
| 11 | serum creatinine | 357.792101 |
| 12 | sodium | 27.558749 |
| 13 | potassium | 2.951339 |
| 14 | haemoglobin | 123.856342 |
| 15 | packed cell volume | 308.181415 |
| 16 | white blood cell count | 9701.050391 |
| 17 | red blood cell count | 19.113025 |
| 18 | ypertension | 88.200000 |
| 19 | diabetes mellitus | 20.439252 |
| 20 | coronary artery disease | 20.400000 |
| 21 | appetite | 49.200000 |
| 22 | pedal edema | 45.600000 |
| 23 | anemia | 36.000000 |
features_rank.columns=['features','score']
features_rank
| features | score | |
|---|---|---|
| 0 | age | 115.859940 |
| 1 | blood pressure | 81.786701 |
| 2 | specific gravity | 0.005035 |
| 3 | albumin | 216.000000 |
| 4 | sugar | 94.800000 |
| 5 | red blood cells | 10.263983 |
| 6 | pus cell | 13.874409 |
| 7 | pus cell clumps | 25.200000 |
| 8 | bacteria | 13.200000 |
| 9 | blood glucose random | 2241.651289 |
| 10 | blood urea | 2343.097145 |
| 11 | serum creatinine | 357.792101 |
| 12 | sodium | 27.558749 |
| 13 | potassium | 2.951339 |
| 14 | haemoglobin | 123.856342 |
| 15 | packed cell volume | 308.181415 |
| 16 | white blood cell count | 9701.050391 |
| 17 | red blood cell count | 19.113025 |
| 18 | ypertension | 88.200000 |
| 19 | diabetes mellitus | 20.439252 |
| 20 | coronary artery disease | 20.400000 |
| 21 | appetite | 49.200000 |
| 22 | pedal edema | 45.600000 |
| 23 | anemia | 36.000000 |
features_rank.nlargest(10,'score')
| features | score | |
|---|---|---|
| 16 | white blood cell count | 9701.050391 |
| 10 | blood urea | 2343.097145 |
| 9 | blood glucose random | 2241.651289 |
| 11 | serum creatinine | 357.792101 |
| 15 | packed cell volume | 308.181415 |
| 3 | albumin | 216.000000 |
| 14 | haemoglobin | 123.856342 |
| 0 | age | 115.859940 |
| 4 | sugar | 94.800000 |
| 18 | ypertension | 88.200000 |
selected=features_rank.nlargest(10,'score')['features'].values
selected
array(['white blood cell count', 'blood urea', 'blood glucose random',
'serum creatinine', 'packed cell volume', 'albumin', 'haemoglobin',
'age', 'sugar', 'ypertension'], dtype=object)
X_new=kidney[selected]
X_new.head()
| white blood cell count | blood urea | blood glucose random | serum creatinine | packed cell volume | albumin | haemoglobin | age | sugar | ypertension | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7800.0 | 36.0 | 121.0 | 1.2 | 44.0 | 1.0 | 15.4 | 48.0 | 0.0 | 1 |
| 1 | 6000.0 | 18.0 | 121.0 | 0.8 | 38.0 | 4.0 | 11.3 | 7.0 | 0.0 | 0 |
| 2 | 7500.0 | 53.0 | 423.0 | 1.8 | 31.0 | 2.0 | 9.6 | 62.0 | 3.0 | 0 |
| 3 | 6700.0 | 56.0 | 117.0 | 3.8 | 32.0 | 4.0 | 11.2 | 48.0 | 0.0 | 1 |
| 4 | 7300.0 | 26.0 | 106.0 | 1.4 | 35.0 | 2.0 | 11.6 | 51.0 | 0.0 | 0 |
len(X_new)
400
X_new.shape
(400, 10)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_new,y,random_state=0,test_size=0.25)
X_train.shape
(300, 10)
y_train.value_counts() #Checking for imbalancing
0 188 1 112 Name: class, dtype: int64

!pip install xgboost
Collecting xgboost Downloading xgboost-1.6.1-py3-none-win_amd64.whl (125.4 MB) Requirement already satisfied: numpy in c:\users\admin\anaconda3\lib\site-packages (from xgboost) (1.21.5) Requirement already satisfied: scipy in c:\users\admin\anaconda3\lib\site-packages (from xgboost) (1.7.3) Installing collected packages: xgboost Successfully installed xgboost-1.6.1
from xgboost import XGBClassifier
params={'learning-rate':[0,0.5,0.20,0.25],
'max_depth':[5,8,10],
'min_child_weight':[1,3,5,7],
'gamma':[0.0,0.1,0.2,0.4],
'colsample_bytree':[0.3,0.4,0.7]}
RandomizedSearchCV :Randomized search on hyper parameters. RandomizedSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

from sklearn.model_selection import RandomizedSearchCV
classifier=XGBClassifier()
random_search=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)
random_search.fit(X_train,y_train)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[00:09:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/learner.cc:627:
Parameters: { "learning-rate" } might not be used.
This could be a false alarm, with some parameters getting used by language bindings but
then being mistakenly passed down to XGBoost core, or some parameter actually being used
but getting flagged wrongly here. Please open an issue if you find any such cases.
RandomizedSearchCV(cv=5,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric=None, gamma=None,
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None, max_bin=None,...
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
predictor=None, random_state=None,
reg_alpha=None, reg_lambda=None, ...),
n_iter=5, n_jobs=-1,
param_distributions={'colsample_bytree': [0.3, 0.4, 0.7],
'gamma': [0.0, 0.1, 0.2, 0.4],
'learning-rate': [0, 0.5, 0.2, 0.25],
'max_depth': [5, 8, 10],
'min_child_weight': [1, 3, 5, 7]},
scoring='roc_auc', verbose=3)
random_search.best_estimator_ #Checking for best model
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.3,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0.1, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning-rate=0.5, learning_rate=0.300000012, max_bin=256,
max_cat_to_onehot=4, max_delta_step=0, max_depth=10, max_leaves=0,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=0, num_parallel_tree=1, predictor='auto',
random_state=0, reg_alpha=0, ...)
random_search.best_params_
{'min_child_weight': 1,
'max_depth': 10,
'learning-rate': 0.5,
'gamma': 0.1,
'colsample_bytree': 0.3}
classifier=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.3, gamma=0.2, gpu_id=-1,
importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0,
max_depth=5, min_child_weight=1,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
classifier.fit(X_train,y_train)
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.3,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0.2, gpu_id=-1, grow_policy='depthwise',
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100,
n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, ...)
y_pred=classifier.predict(X_test)
y_pred
array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1])
from sklearn.metrics import confusion_matrix,accuracy_score
confusion_matrix(y_test,y_pred)
array([[61, 1],
[ 0, 38]], dtype=int64)
accuracy_score(y_test,y_pred)
0.99